Linear Method: review


\[ \begin{equation} \nonumber \hat{Y}= X\beta \end{equation} \]

The OLS estimators solve:

\[ \begin{equation} \nonumber \hat{\beta}^{OLS} = \underset{\hat{\beta}}{\operatorname{argmin}} \sum_{i=1}^n (y_i - X_i\beta)^2 \end{equation} \]

Shrinkage methods solve:

\[ \begin{equation} \nonumber \hat{\beta} = \underset{\hat{\beta}}{\operatorname{argmin}} \sum_{i=1}^n (y_i - X_i\beta)^2 + \lambda (\alpha\sum_{j=1}^p|\beta_j| + \frac{(1-\alpha)}{2} \sum_{j=1}^p\beta_j^2) \end{equation} \]

But what if…

What would you use?

…Tricky question….mumble mumble…

A simple line?

mmm..Does not seem to work..

What about..

\[ \begin{equation} \nonumber y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \end{equation} \]

Is this still linear regression?


Yes, it can still be written as

\[ \begin{equation} \nonumber \hat{Y} = X\beta \end{equation} \]

Where

\[ X= \begin{bmatrix} 1 & \textbf{X}_1 & \textbf{X}_1^2 \\ 1 & \textbf{X}_2 & \textbf{X}_2^2 \\ : & : & : \\ 1 & \textbf{X}_n & \textbf{X}_n^2 \\ \end{bmatrix} \]

So…Any regression is linear if it is linear for its parameters.

A parabula?

plot(cbind(lidar$range, lidar$logratio), xlab="x", ylab="y", main="Lidar", pch=19, col="red")
pol <- lm(logratio ~ poly(range,2), data=lidar)
lines(cbind(lidar$range,predict(pol)), col="blue", lwd=2)

Still not very convincing…

3 degree polynomial?

Maybe…

8 Degree polynomial?

Maybe too much…

Which polynomial

If we want to stick to polynomials how do we choose the degree?

This time the answer is the fifth!


Each method may work in specific circumstances, and you might want to have a look at each one before choosing the level of complexity of your curve.

But still…If you want an high predictive power then use cross validation (almost no assumptions required)!

Can we do better then polynomial?

  • Select k knots on the range of x (usually at the quantiles).
  • Fit a local polynomial between each node.

So we break the range of X in non-overlapping pieces and we construct \(f_j(X)\) = I(\(o_j\) \(\le\) X \(\le\) \(t_j\)) , a function that depends on a given range of X.

A line with 4 knots

kn <- as.numeric( quantile(lidar$range, (1:4)/5) )

# prepare the matrix of covariates / explanatory variables
x <- matrix(0, dim(lidar)[1], length(kn)+1)
for(j in 1:length(kn)) {
  x[,j] <- pmax(lidar$range-kn[j], 0)
}
x[, length(kn)+1] <- lidar$range

# Fit the regression model
ppm <- lm(lidar$logratio ~ x)
plot(logratio~range, data=lidar, pch=19, col='red', cex=1, main="lidar", xlab="x", ylab="y")
lines(predict(ppm)[order(range)]~sort(range), data=lidar, lwd=2, col='blue')

Quadratic line with 4 knots

kn <- as.numeric( quantile(lidar$range, (1:4)/5) )

# prepare the matrix of covariates / explanatory variables
x <- matrix(0, dim(lidar)[1], length(kn)+2)
for(j in 1:length(kn)) {
  x[,j] <- pmax(lidar$range-kn[j], 0)^2 ## <<<---- Note: here you put the degree
}
x[, length(kn)+1] <- lidar$range
x[, length(kn)+2] <- lidar$range^2     ## <<------ Note: also here(see formula)
# Fit the regression model
ppm <- lm(lidar$logratio ~ x)
plot(logratio~range, data=lidar, pch=19, col='red', cex=1, main="lidar", xlab="x", ylab="y")
lines(predict(ppm)[order(range)]~sort(range), data=lidar, lwd=2, col='blue')

Now it is also differentiable…

Cubic fit with 60 knots

For a linear fit:

\[ \begin{equation} E[y_i|x_i] = \beta_0 + \beta_1 x_i + \sum_{j=1}^k \beta_{j+1} f_j(x_i) \end{equation} \]

Where

\[ \begin{equation} \nonumber f_j(x_i) = (x_i - k_j)_+ \end{equation} \]

\((x_i - k_j)_+ = x_i - k_j\) if positive, zero otherwise.

For a quadratic fit

For a quadratic fit:

\[ \begin{equation} \nonumber E[y_i|x_i] = \beta_0 + \beta_1 x_i + \beta_2x_i^2 + \sum_{j=1}^k \beta_{j+2} f_j(x_i) \end{equation} \]

Where

\[ \begin{equation} \nonumber f_j(x_i) = (x_i - k_j)_+^2 \end{equation} \]

For a p degree fit

Generally for a p degree fit:

\[ \begin{equation} \nonumber E[y_i|x_i] = \beta_0 + \sum_{j=1}^p \beta_j x_i^j + \sum_{j=1}^k \beta_{j+p} f_j(x_i) \end{equation} \]

Where

\[ \begin{equation} f_j(x_i) = (x_i - k_j)_+^p \end{equation} \]

Issues

  • How many knots?
  • Whaat degree?

We will solve this problem using natural cubic splines. Natural cubic splines are linear at the end knots and have a continuous second order derivative.

What about using a tuning parameter?

\[ \begin{equation} \underset{f}{\operatorname{min}} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f'' (t))^2 dt \end{equation} \]

The solution is a natural cubic spline!


Again, we should find the best trade off between complexity and variance in order to maximize the out of sample MSPE. How to choose lambda?


… CV!

Which CV?

Whether you want to use k-folds or loocv may depend on several factors. In this specific case loocv is less computational intensive. In fact we can simply use the following formula:

\[ \begin{equation} MSPE(\hat{Y}) = \frac{1}{N}{\frac{\sum (y_i - \hat{f}(x_i))}{(1 - trace(S_\lambda)/N)}} \end{equation} \]

In R the package splines has a built-in function to do cross validation.

Optimal fit with Smoothing Splines

tmp.cv <- smooth.spline(x=lidar$range, y=lidar$logratio, cv=TRUE,
                        all.knots=TRUE)
# optimal regularization parameter: tmp.cv$spar = 0.974


# compute the optimal fit
tmp <- smooth.spline(x=lidar$range, y=lidar$logratio, spar=tmp.cv$spar, cv=FALSE,
                     all.knots=TRUE)
plot(cbind(lidar$range, lidar$logratio), xlab="x", ylab="y", main="Lidar", pch=19, col="red")
lines(tmp, col="blue", lwd=2)

Problem

We are interested to predict the quantity of Nitrogen dioxide concentration using the weighted mean of distances to employment centers. To do so we have at disposal real data collected in Boston.

Open the data set Boston and plot with a scatter plot the variable nox as a function of the variable dis. What do you notice?

first

  • Construct a linear regression and provide an interpretation of the coefficient
  • Construct a polynomial fit with 2, 3 and 8 degrees
  • Select the polynomial fit that you think fit best
  • Check whether your answer is correct using cross validation

then

  • Construct a linear spline with 5 knots
  • Construct a smoothing spline with the best predictive power
  • Construct natural cubic splines with 3, 5, 10, 40 degrees of freedom and select the one that fit best.

Hint: the skeleton of the code is provided for this exercise.

Solution

data(Boston, package="MASS")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7, pch=19)

reg1 <- lm(nox~dis, data= Boston )
summary(reg1)
## 
## Call:
## lm(formula = nox ~ dis, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12239 -0.05212 -0.01257  0.04391  0.23041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.715343   0.006796  105.26   <2e-16 ***
## dis         -0.042331   0.001566  -27.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07412 on 504 degrees of freedom
## Multiple R-squared:  0.5917, Adjusted R-squared:  0.5909 
## F-statistic: 730.4 on 1 and 504 DF,  p-value: < 2.2e-16

Polynomial fit

s3 <- list()
for (i in c(2,3,8)) {  
  pol <- lm(nox ~ poly(dis,i), data=Boston)
 s3[[i]] <- pol 
  }

summary(s3[[2]])
## 
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.129559 -0.044514 -0.007753  0.025778  0.201882 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002828  196.16   <2e-16 ***
## poly(dis, i)1 -2.003096   0.063610  -31.49   <2e-16 ***
## poly(dis, i)2  0.856330   0.063610   13.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06361 on 503 degrees of freedom
## Multiple R-squared:  0.6999, Adjusted R-squared:  0.6987 
## F-statistic: 586.4 on 2 and 503 DF,  p-value: < 2.2e-16
summary(s3[[3]])
## 
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, i)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, i)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, i)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16
summary(s3[[8]])
## 
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.132328 -0.037459 -0.008615  0.023667  0.197200 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002702 205.312  < 2e-16 ***
## poly(dis, i)1 -2.003096   0.060774 -32.960  < 2e-16 ***
## poly(dis, i)2  0.856330   0.060774  14.091  < 2e-16 ***
## poly(dis, i)3 -0.318049   0.060774  -5.233 2.46e-07 ***
## poly(dis, i)4  0.033547   0.060774   0.552  0.58120    
## poly(dis, i)5  0.133009   0.060774   2.189  0.02909 *  
## poly(dis, i)6 -0.192439   0.060774  -3.166  0.00164 ** 
## poly(dis, i)7  0.169628   0.060774   2.791  0.00545 ** 
## poly(dis, i)8 -0.117703   0.060774  -1.937  0.05334 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06077 on 497 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7249 
## F-statistic: 167.4 on 8 and 497 DF,  p-value: < 2.2e-16
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
     pch=19)

lines(predict(s3[[2]])[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue', lty="dotted")
lines(predict(s3[[3]])[order(dis)]~sort(dis), data=Boston, col="red", lwd=2, lty="dotted")
lines(predict(s3[[8]])[order(dis)]~sort(dis), data=Boston, col="black", lwd=2, lty="dotted")

legend("topright", legend=c("2 degree","3 degree", "8 degree"),        
      
cex=0.7,        
       lwd=1, col=c("blue","red", "black"))

Find the best with CV

set.seed(123)

n <- length(Boston$nox)
k <- 5

ii <- sample(rep(1:k, length= n))

pr.2 <- pr.3 <- pr.8 <- rep(NA, length=n)
for (j in 1:k){
  hold <- (ii == j)
  train <- (ii != j)
  xx.tr <- Boston[train,]

  xx.te <- Boston[hold,]
  
  pol2 <- lm(nox ~ poly(dis,2), data=xx.tr)
  pol3 <- lm(nox ~ poly(dis,3), data=xx.tr)
  pol8 <- lm(nox ~ poly(dis,8), data=xx.tr) ## <-- Remember to put data = xx.tr

  pr.2[hold] <- predict(pol2, newdata=xx.te)
  pr.3[hold] <- predict(pol3, newdata=xx.te)
  pr.8[hold] <- predict(pol8, newdata=xx.te)
}

mspe2degree <- mean((pr.2-Boston$nox)^2)
mspe3degree <- mean((pr.3-Boston$nox)^2)
mspe8degree <- mean((pr.8-Boston$nox)^2)

print(cbind(mspe2degree, mspe3degree, mspe8degree))
##      mspe2degree mspe3degree mspe8degree
## [1,] 0.004139044 0.003882791  0.01528747

Second part: Spline

Second Part: Smoothing Spline

## SMoothing spline with cross validation

tmp.cv <- smooth.spline(x=Boston$dis, y=Boston$nox, cv=TRUE,
                        all.knots=TRUE)


# compute the optimal fit
tmp <- smooth.spline(x=Boston$dis, y=Boston$nox, spar=tmp.cv$spar, cv=FALSE,
                     all.knots=TRUE)

plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
     pch=19)
lines(tmp, col="blue", lwd=2)

Compute MSPE

## Construct cubic splines and compare with cross validation

library(splines)
n <- nrow(Boston)
set.seed(123)

k <- 5

ii <- sample(rep(1:k, length= n))

pr.3.ns  <- pr.5.ns <- pr.10.ns <-  pr.40.ns  <- rep(0, n)
for(j in 1:k) {
  
  ## Construct the splines (note ns() create a basis expansion)
  
  reg3.ns <- lm(Boston[ii != j,]$nox ~ ns(x=dis, df=3), data=Boston[ ii!=j,])
  reg5.ns <- lm(Boston[ii != j,]$nox ~ ns(x=dis, df=5), data=Boston[ ii!=j,])
  reg10.ns <- lm(Boston[ii != j,]$nox~ ns(x=dis, df=10), data=Boston[ ii!=j,])
  reg40.ns <- lm(Boston[ii != j,]$nox ~ns(x=dis, df=40), data=Boston[ ii!=j,])
  
  ## Make predictions

  pr.3.ns[ ii == j ] <- predict(reg3.ns, newdata=Boston[ii==j,])

  pr.5.ns[ ii == j ] <- predict(reg5.ns , newdata=Boston[ii==j,])

  pr.10.ns[ ii == j ] <- predict(reg10.ns, newdata=Boston[ii==j,])
  pr.40.ns[ ii == j ] <- predict(reg40.ns, newdata=Boston[ii==j,])
}

## Compute the MSPE

mspe3.ns <- mean( (Boston$nox - pr.3.ns)^2 )
mspe5.ns <- mean( (Boston$nox- pr.5.ns)^2 )
mspe10.ns <- mean( (Boston$nox - pr.10.ns)^2 )
mspe40.ns <- mean( (Boston$nox - pr.40.ns)^2 )

mspe.ns <- c(mspe3.ns,mspe5.ns,mspe10.ns,mspe40.ns)
## Look at it
print(mspe.ns)
## [1] 0.003876472 0.003761332 0.003731666 0.004123112
## Give an even better look 

plot(c(3,5,10,40),mspe.ns, xlab="degrees of freedom", ylab="mspe", cex=1, pch=19, col="purple", main="Natural Splines")
lines(cbind(c(3,5,10,40),mspe.ns), lty="dotted", col="purple", lwd=1)

Kernel Smoothers

What if I predict the value of y using the mean of \(y_i\) at \(x_i\)?

\[ \begin{equation} \nonumber \hat{y} = average(y_i : x_i = x) \end{equation} \]

Or in a certain bandwidth?

\[ \begin{equation} \nonumber \hat{y} = average(y_i : x - 0.5 < x_i < x + 0.5) \end{equation} \]

For predicting values with dis = 2 and dis = 8, we get:

If we want to use the local averages:

\[ \begin{equation} \nonumber \hat{f}(x) =\frac{1}{N_x} \sum_{i=1}^n y_i I(|x_i - x| < h) \end{equation} \]

Where h is the extension of the bandwidth (in the previous example it was 1), \(N_x\) is the number of observations in the given bandwidth and I(boolean) is an indicator function equal to one if the expression inside is true, 0 otherwise.


The same formula can be written as:

\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i I(|x_i - x| < h)}{\sum_{i=1}^nI(|x_i - x| < h)*1} \end{equation} \]

And without loss of generality:

\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i K_h(x_i, x)}{\sum_{i=1}^n K_h(x_i, x)} \end{equation} \]

With

\[ \begin{equation} \nonumber K_h(x_i, x) = W(\small{|x_i - x|/h}) \end{equation} \]

\(W(z) = 1\) if \(z \le 1\), 0 otherwise.

KSM with 0.2 bandwidth

## h = 0.2

ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="box", bandwidth=0.20)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
     pch=19)
lines(ksm, lwd=2, col="black")

What if we want a continuous function:

  • The only possibility is to have more observations
  • We cannot have the certainty of having a continuous function with KSM
  • We square the component z
  • We increase the bandwidth
  • The last two

KSM with 1 bandwidth

ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="box", bandwidth=1)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
     pch=19)
lines(ksm, lwd=2, col="black")

KSM with normal kernel

ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="normal", bandwidth=0.5)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
     pch=19)
lines(ksm, lwd=2, col="black")

The intuition:

Let’s go back to:

\[ \begin{equation} \nonumber K_h(x_i, x) = W(\small{|x_i - x|/h}) \end{equation} \]

Now:

\(W(z) = 1 - z^2\) if \(z \le 1\), 0 otherwise.

What is the intuition behind?


We can assign different weights according to the distance of each observation from the value of interest \(x_j\)

This is alternate text.

Local fit

What we analyzed:

\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i K_h(x_i, x)}{\sum_{i=1}^n K_h(x_i, x)} \end{equation} \]

It corresponds to local averages. So, it is equivalent to:

\[ \begin{equation} \nonumber \hat{f}(x) = \underset{\mu}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\mu)^2 K_h(x_i, x) \end{equation} \]

Where the solution \(\mu\) is the weighted average in each bandwith for each \(x_i\).

An alternative way to compute local averages is through:

\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x))^2 K_h(x_i, x) \end{equation} \]

Which is equivalent to:

\[ \begin{equation} \nonumber \hat{f}(x) = \underset{\mu}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\mu)^2 K_h(x_i, x) \end{equation} \]

Remember in fact the Taylor expansion:

\[ f(x) = f(x_0) + f'(x_0) (x - x_0) + \frac{1}{2}f''(x_0) (x-x_0)^2 + R \]

But then we could also..

\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x) - \beta_2(x_i - x)^2)^2 K_h(x_i, x) \end{equation} \]

Or more generally:

\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x) - ... - \beta_k (x_i - x)^k)^2 K_h(x_i, x) \end{equation} \]

Note that the higher the degree, the lower the bias, but the higher the variance!

By assuming \(\hat{f}(x) = X\beta\) an alternative expression is:

\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1x_i)^2 K_h(x_i, x) \end{equation} \]

Isn’t still a local fit?


For each point we fit a linear regression, weighting differently the observations.

A compact solution is through weighted least squares:

\[ \hat{\beta}(\textbf{x}) =(X' W_\textbf{x} X)^{-1} (X'W_\textbf{x}Y), \forall \textbf{x} \]

\[ W_\textbf{x}= \begin{bmatrix} K_h(x_1,x) & 0 & ... & 0 \\ 0 & K_h(x_2,x) & ... & 0 \\ : & : & : & :\\ 0 & 0 &... & K_h(x_n,x) \\ \end{bmatrix} \]

Is it a linear predictor?


It can still be written as

\[ \begin{equation} \nonumber \hat{f}(x) = S_h Y \end{equation} \]

Where S does not depend on Y and where \(\hat{f}(x)\) is a weighted least squares estimator. Why do you think it is important this property?…LOOCV for free?

Issues:

  • Extension of the bandwidth
  • Degree of the polynomial

Examples

lp <- loess(nox  ~ dis, data= Boston, span=0.50, degree=1, family="gaussian")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data spam = 0.5", col="green", cex=0.7,
     pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')

lp <- loess(nox  ~ dis, data= Boston, span=0.01, degree=1, family="gaussian")
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.7321
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.0112
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.00021904
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : k-d tree limited by memory. ncmax= 506
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data spam = 0.01 ", col="green", cex=0.7,
     pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')

R is complaining!!

lp <- loess(nox  ~ dis, data= Boston, span=2, degree=2, family="gaussian")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data s = 2 d = 2", col="green", cex=0.7,
     pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')

Be careful…

do you think local regression works well for multivariate cases?


Suppose that you have some observations uniformly distributed between 0 and 1. You want to use a kernel smoother with 0.5 of bandwidth with no covariates (only dependent variable), how many dimension do you have? What is your best estimator?

Now suppose that you use one single covariate (two-dimensional case) using the same bandwidth of 0.5. Both y and x are uniformly distributed between 0 and 1. Do you expect to have the same number of observations for each bandwidth?

This is alternate text


As the number of dimension increases, the number of observation within each bandwidth EXPONENTIALLY decreases. This may hugely affect the predictive power of local regressions with either many variables or few observations.

Exercise

We are interested in predicting the number of fatalities controlling for taxation of beer. The only objective is to have a strong predictor of fatalities rate. To do so, we have a data set with 336 observations and several social and geographical indicators. Download the data set fatalities and plot a scatter plot of fatalities in function of taxation on beer. What do you notice? What kind of regression would you use?

  • Construct a linear regression of fatalities in function of taxation on beer and write down the interpretation of the coefficients
  • Construct a quadratic polynomial (use simple linear regression!) and report the values and the interpretation of the coefficients
  • Now construct a kernel smoother with bandwidth = 0.10 and a box kernel. What do you notice? (Hint: use the package KernSmooth)
  • Construct a new smoother with same bandwidth but normal kernel.
  • Construct a smoother with box kernel and 0.5 of bandwidth. Compare graphically the three results
  • Select the smoother with normal kernel with smallest MSPE selecting between a bandwidth of 0.1, 0.3, 0.5 and 1. (Hint: skeleton of the code is provided)

then

  • Construct local polynomials using the function loess()
  • Find the best spam with degree 2 using cross validation
  • Compare the result with the previous kernel smoother
  • Plot the results

Optional Challenge

  • Construct a smoothing spline with beertax and with the best lambda(use loocv)
  • Construct a lasso regression and find 1 variable strongly correlated with fatal(use the variables with the highest coefficient)
  • Construct a local polynomial of degree 2 using this variable. Select the span with cross-validation
  • Compute the MSPE of a local polynomial model with span 0.5 and degree 2 and the variable with the highest correlation with fatal selected through lasso regression. (Be careful: how do you cross validate?)

Hint: skeleton of the code is provided.

solution

load("./data/fatalities.rda")
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
reg1 <- lm(fatal ~ beertax, data=fatalities)
abline(reg1)

summary(reg1)
## 
## Call:
## lm(formula = fatal ~ beertax, data = fatalities)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -851.4 -634.5 -230.0  137.2 4616.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   878.35      74.86  11.733   <2e-16 ***
## beertax        98.03     106.82   0.918    0.359    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 934.3 on 334 degrees of freedom
## Multiple R-squared:  0.002515,   Adjusted R-squared:  -0.0004712 
## F-statistic: 0.8422 on 1 and 334 DF,  p-value: 0.3594
reg2 <- lm(fatal ~ poly(beertax,2), data=fatalities)
summary(reg2)
## 
## Call:
## lm(formula = fatal ~ poly(beertax, 2), data = fatalities)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -903.5 -573.7 -244.6  121.6 4494.3 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         928.66      50.68  18.323   <2e-16 ***
## poly(beertax, 2)1   857.41     929.04   0.923   0.3567    
## poly(beertax, 2)2  2030.04     929.04   2.185   0.0296 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 929 on 333 degrees of freedom
## Multiple R-squared:  0.01662,    Adjusted R-squared:  0.01071 
## F-statistic: 2.813 on 2 and 333 DF,  p-value: 0.06144
library(KernSmooth)
ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="box", bandwidth=0.10)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")

ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="box", bandwidth=0.50)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")

ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="normal", bandwidth=0.10)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")

Select bandwidth with CV

set.seed(123)

n <- length(fatalities$fatal)
k <- 20

ii <- sample(rep(1:k, length= n))
mspepol <- mspe01 <- mspe03 <- mspe05 <- mspe1 <- rep(NA, length = k)

for (j in 1:k){
  hold <- (ii == j)
  train <- (ii != j)
  xx.tr <- fatalities[train,]
  
  xx.te <- fatalities[hold,]
  
  pol2 <- lm(fatal ~ poly(beertax,2), data=xx.tr)
  
  ksm0.1 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.10)
  ksm0.3 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.30)
  ksm0.5 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.50)
  ksm1 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=1)
  
  
  
  prpol      <- predict(pol2, newdata=xx.te)
  
  ## predict does not work with ksmooth, here a possible approximation
  pr.01 <- pr.03 <- pr.05 <- pr.1 <- NULL
  for (i in xx.te$beertax){ 
  pr.01       <- append(pr.01, ksm0.1$y[which.min(abs(ksm0.1$x - i))])
  pr.03       <- append(pr.03,ksm0.3$y[which.min(abs(ksm0.3$x - i))])
  pr.05       <- append(pr.05,ksm0.5$y[which.min(abs(ksm0.5$x - i))])
  pr.1        <- append(pr.1,ksm1$y[which.min(abs(ksm1$x - i))])
  }
  
  mspepol[j] <- mean((xx.te$fatal - prpol)^2)
  mspe01[j] <- mean((xx.te$fatal - pr.01)^2)
  mspe03[j] <- mean((xx.te$fatal - pr.03)^2)
  mspe05[j] <- mean((xx.te$fatal - pr.05)^2)
  mspe1[j] <- mean((xx.te$fatal - pr.1)^2)
  
}

mspe <- cbind(mspepol, mspe01, mspe03, mspe05, mspe1)
colMeans(mspe)
##  mspepol   mspe01   mspe03   mspe05    mspe1 
## 869388.8 685685.6 791856.2 819653.5 859488.6

Exercise 2: Local Polynomial

library(MASS)
fatalities <- na.omit(fatalities)
# Best loess


n <- length(fatalities$fatal)
k <- 10
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)

for (j in 1:k){
  hold <- (ii == j)
  train <- (ii != j)
  xx.tr <- fatalities[train,]
  xx.te <- fatalities[hold,]
  
 mspe <- NULL
  for (i in seq(from=0.1,to=2,by=0.1)){
    lp <- loess(fatal  ~ beertax, data= xx.tr, span=i, degree=2, family="gaussian",control=loess.control(surface="direct")) 
    pr <- predict(lp, newdata=xx.te)
    mspe <- append(mspe, mean((xx.te$fatal - pr)^2))
  }
 mspe.final <- cbind(mspe.final, mspe)
}


plot(cbind(seq(from=0.1,to=2,by=0.1), rowMeans(mspe.final)), type="n", main="fatal(beertax)", xlab="span", ylab="MSPE")
lines(cbind(seq(from=0.1,to=2,by=0.1), rowMeans(mspe.final)) ,col="blue", lwd=2)

Plot the best loess

plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
ppm <- loess(fatal  ~ beertax, data= fatalities, span=0.1, degree=2, family="gaussian",control=loess.control(surface="direct")) 
lines(predict(ppm)[order(beertax)]~sort(beertax), data=fatalities, lwd=2, col='blue')

Opional Challenge

## Smoothing Spline using LOOCV
fatalities <- na.omit(fatalities)

tmp.cv <- smooth.spline(x= fatalities$beertax, y=fatalities$fatal, cv=TRUE,
                        all.knots=TRUE)
# compute the optimal fit
tmp <- smooth.spline(x= fatalities$beertax, y=fatalities$fatal, cv=FALSE,
                     all.knots=TRUE, spar=tmp.cv$spar)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(tmp, col="blue", lwd=2)

**Close to the smoother…*

Build a lasso

## Build a lasso regression

library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.2.5
## Loading required package: foreach
## Loaded glmnet 2.0-5
lambdas.rr <- exp(seq(-10, 2, length=50))
xx <- fatalities[,-c(17,18,19,20,21,22,23,24,25,26)]
xx <- model.matrix(~., data=xx)[,-1] ## <- Use this as covariate matrix
y <- fatalities[,17]
k <- 10


p <- ncol(xx)
n <- nrow(xx)
xx <- scale(scale(xx), center=FALSE, scale=rep(sqrt(n-1), p))


lr.cv <- cv.glmnet(x=xx, y=y, lambda=lambdas.rr, nfolds=k, alpha=1)
plot(lr.cv)

lr.final <- glmnet(x=as.matrix(xx), y=as.vector(y), lambda=lr.cv$lambda.1se, alpha=1)
rownames(lr.final$beta)[which.max(abs(as.vector(lr.final$beta)))]
## [1] "milestot"

Select

n <- length(fatalities$fatal)
k <- 10
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)

for (j in 1:k){
  hold <- (ii == j)
  train <- (ii != j)
  xx.tr <- fatalities[train,]
  xx.te <- fatalities[hold,]
  
  mspe <- NULL
  for (i in seq(from=0.1,to=2,by=0.1)){
    lp <- loess(fatal  ~ milestot, data= xx.tr, span=i, degree=2, family="gaussian",control=loess.control(surface="direct")) 
    pr <- predict(lp, newdata=xx.te)
    mspe <- append(mspe, mean((xx.te$fatal - pr)^2))
  }
  mspe.final <- cbind(mspe.final, mspe)
}

rowMeans(mspe.final)
##  [1] 35722.60 36314.22 36471.02 36529.62 36497.40 36427.97 36467.93
##  [8] 36532.42 37108.30 39568.14 39675.23 39811.85 39942.90 40057.44
## [15] 40154.87 40237.26 40307.06 40366.52 40417.50 40461.51
## Best span: 0.1 

Find MSPE

n <- length(fatalities$fatal)
k <- 20
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)
xx <- fatalities[,-c(1,2,14,15,16,17,18,19,20,21,22,23,24,25,26)]
xx <- model.matrix(~., data=xx)[,-1] ## <- Use this as covariate matrix
y <- fatalities[,17]
pr <- rep(0, length= n)
for (j in 1:k){
  hold <- (ii == j)
  train <- (ii != j)
  xx.tr <- xx[train,]
  y.tr <- y[train]
  xx.te <- xx[hold,]
  y.te  <- y[hold]
  p <- ncol(xx.tr)
  n <- nrow(xx.tr)
  xxm <- scale(scale(xx.tr), center=FALSE, scale=rep(sqrt(n-1), p))
  
  
  lr.cv <- cv.glmnet(x=xxm, y=y.tr, lambda=lambdas.rr, nfolds=k, alpha=1)

  
  ## Find the variable
  lr.final <- glmnet(x=as.matrix(xxm), y=as.vector(y.tr), lambda=lr.cv$lambda.min, alpha=1)
  
  ## Use the variable to build your new data frame
  
  dat.tr <- xx.tr[,which.max(abs(as.vector(lr.final$beta)))]
  dat.te <- xx.te[,which.max(abs(as.vector(lr.final$beta)))]
  dd.tr <- as.data.frame(cbind(dat.tr, y.tr))
  dd.te <- as.data.frame(cbind(dat.te, y.te))
  
  ## Assign names to the data frame
  names(dd.tr) <- names(dd.te) <- c("x", "y")
  
  ## Cosntruct a polynomial and predict
  lp <- loess(y  ~ x, data=dd.tr, span=0.1, degree=2, family="gaussian",control=loess.control(surface="direct")) 
  pr[hold] <- predict(lp, newdata=dd.te)
  
}


mean((y-pr)^2)
## [1] 36263.98